icd-10 code
H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis
Lim, Seungseop, Kim, Gibaeg, Lee, Hyunkyung, Han, Wooseok, Seo, Jean, Yoo, Jaehyo, Yang, Eunho
An accurate differential diagnosis (DDx) is essential for patient care, shaping therapeutic decisions and influencing outcomes. Recently, Large Language Models (LLMs) have emerged as promising tools to support this process by generating a DDx list from patient narratives. However, existing evaluations of LLMs in this domain primarily rely on flat metrics, such as Top-k accuracy, which fail to distinguish between clinically relevant near-misses and diagnostically distant errors. To mitigate this limitation, we introduce H-DDx, a hierarchical evaluation framework that better reflects clinical relevance. H-DDx leverages a retrieval and reranking pipeline to map free-text diagnoses to ICD-10 codes and applies a hierarchical metric that credits predictions closely related to the ground-truth diagnosis. In benchmarking 22 leading models, we show that conventional flat metrics underestimate performance by overlooking clinically meaningful outputs, with our results highlighting the strengths of domain-specialized open-source models. Furthermore, our framework enhances interpretability by revealing hierarchical error patterns, demonstrating that LLMs often correctly identify the broader clinical context even when the precise diagnosis is missed.
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
On Using Large Language Models to Enhance Clinically-Driven Missing Data Recovery Algorithms in Electronic Health Records
Lotspeich, Sarah C., Collins, Abbey, Wells, Brian J., Khanna, Ashish K., Rigdon, Joseph, McGowan, Lucy D'Agostino
Objective: Electronic health records (EHR) data are prone to missingness and errors. Previously, we devised an "enriched" chart review protocol where a "roadmap" of auxiliary diagnoses (anchors) was used to recover missing values in EHR data (e.g., a diagnosis of impaired glycemic control might imply that a missing hemoglobin A1c value would be considered unhealthy). Still, chart reviews are expensive and time-intensive, which limits the number of patients whose data can be reviewed. Now, we investigate the accuracy and scalability of a roadmap-driven algorithm, based on ICD-10 codes (International Classification of Diseases, 10th revision), to mimic expert chart reviews and recover missing values. Materials and Methods: In addition to the clinicians' original roadmap from our previous work, we consider new versions that were iteratively refined using large language models (LLM) in conjunction with clinical expertise to expand the list of auxiliary diagnoses. Using chart reviews for 100 patients from the EHR at an extensive learning health system, we examine algorithm performance with different roadmaps. Using the larger study of $1000$ patients, we applied the final algorithm, which used a roadmap with clinician-approved additions from the LLM. Results: The algorithm recovered as much, if not more, missing data as the expert chart reviewers, depending on the roadmap. Discussion: Clinically-driven algorithms (enhanced by LLM) can recover missing EHR data with similar accuracy to chart reviews and can feasibly be applied to large samples. Extending them to monitor other dimensions of data quality (e.g., plausability) is a promising future direction.
- North America > United States > North Carolina > Forsyth County > Winston-Salem (0.14)
- South America (0.14)
- North America > United States > Texas > Harris County > Houston (0.04)
- (9 more...)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.93)
Using LLMs for Multilingual Clinical Entity Linking to ICD-10
Vassileva, Sylvia, Koychev, Ivan, Boytcheva, Svetla
The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.
MedSynth: Realistic, Synthetic Medical Dialogue-Note Pairs
Mianroodi, Ahmad Rezaie, Rezaie, Amirali, Todorov, Niko Grisel, Rakovski, Cyril, Rudzicz, Frank
Physicians spend significant time documenting clinical encounters, a burden that contributes to professional burnout. To address this, robust automation tools for medical documentation are crucial. We introduce MedSynth -- a novel dataset of synthetic medical dialogues and notes designed to advance the Dialogue-to-Note (Dial-2-Note) and Note-to-Dialogue (Note-2-Dial) tasks. Informed by an extensive analysis of disease distributions, this dataset includes over 10,000 dialogue-note pairs covering over 2000 ICD-10 codes. We demonstrate that our dataset markedly enhances the performance of models in generating medical notes from dialogues, and dialogues from medical notes. The dataset provides a valuable resource in a field where open-access, privacy-compliant, and diverse training data are scarce. Code is available at https://github.com/ahmadrezarm/MedSynth/tree/main and the dataset is available at https://huggingface.co/datasets/Ahmad0067/MedSynth.
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Therapeutic Area > Musculoskeletal (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology (1.00)
- (10 more...)
MedSyn: Enhancing Diagnostics with Human-AI Collaboration
Sayin, Burcu, Schlicht, Ipek Baris, Hong, Ngoc Vo, Allievi, Sara, Staiano, Jacopo, Minervini, Pasquale, Passerini, Andrea
Clinical decision-making is inherently complex, often influenced by cognitive biases, incomplete information, and case ambiguity. Large Language Models (LLMs) have shown promise as tools for supporting clinical decision-making, yet their typical one-shot or limited-interaction usage may overlook the complexities of real-world medical practice. In this work, we propose a hybrid human-AI framework, MedSyn, where physicians and LLMs engage in multi-step, interactive dialogues to refine diagnoses and treatment decisions. Unlike static decision-support tools, MedSyn enables dynamic exchanges, allowing physicians to challenge LLM suggestions while the LLM highlights alternative perspectives. Through simulated physician-LLM interactions, we assess the potential of open-source LLMs as physician assistants. Results show open-source LLMs are promising as physician assistants in the real world. Future work will involve real physician interactions to further validate MedSyn's usefulness in diagnostic accuracy and patient outcomes.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (10 more...)
Evaluating Hierarchical Clinical Document Classification Using Reasoning-Based LLMs
Mustafa, Akram, Naseem, Usman, Azghadi, Mostafa Rahimi
Background: Clinical coding, particularly the classification of hierarchical ICD-10 codes from unstructured discharge summaries, is essential for healthcare operations, but remains a labor-intensive and error-prone task. Automated approaches using Large Language Models (LLMs) offer the potential to augment or replace human coders, yet their reliability and reasoning capabilities, which is needed to ensure accurate, explainable code assignments, are not well understood. Objective: This study aims to benchmark a diverse set of LLMs, both reasoning and non-reasoning models, on their ability to classify hierarchical ICD-10 codes from discharge summaries and evaluate the effect of structured reasoning on model performance. Methods: Using the MIMIC-IV dataset, the study selected 1,500 discharge summaries labeled with the top 10 most frequent ICD-10 codes, balancing dataset size with the high computational and financial cost of using LLMs. We first preprocessed the data to extract clinically relevant tokens before feeding it to the LLMs. Specifically, we used cTAKES, a clinical NLP tool, to identify medical concepts. Each summary was encoded and submitted to 11 LLMs using a standardized, structured prompt simulating a clinical coder. Models were evaluated using the F1 score across three ICD-10 levels for both primary and all diagnoses classification tasks. Reasoning models on average outperformed non-reasoning models. The Gemini 2.5 Pro model demonstrated the highest performance across tasks.
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.68)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
- Health & Medicine > Health Care Technology > Medical Record (0.93)
- Health & Medicine > Health Care Providers & Services (0.68)
Regularizing Log-Linear Cost Models for Inpatient Stays by Merging ICD-10 Codes
Lu, Chi-Ken, Alonge, David, Richardson, Nicole, Richard, Bruno
Cost models in healthcare research must balance interpretability, accuracy, and parameter consistency. However, interpretable models often struggle to achieve both accuracy and consistency. Ordinary least squares (OLS) models for high-dimensional regression can be accurate but fail to produce stable regression coefficients over time when using highly granular ICD-10 diagnostic codes as predictors. This instability arises because many ICD-10 codes are infrequent in healthcare datasets. While regularization methods such as Ridge can address this issue, they risk discarding important predictors. Here, we demonstrate that reducing the granularity of ICD-10 codes is an effective regularization strategy within OLS while preserving the representation of all diagnostic code categories. By truncating ICD-10 codes from seven characters (e.g., T67.0XXA, T67.0XXD) to six (e.g., T67.0XX) or fewer, we reduce the dimensionality of the regression problem while maintaining model interpretability and consistency. Mathematically, the merging of predictors in OLS leads to increased trace of the Hessian matrix, which reduces the variance of coefficient estimation. Our findings explain why broader diagnostic groupings like DRGs and HCC codes are favored over highly granular ICD-10 codes in real-world risk adjustment and cost models.
- North America > United States > New Jersey > Essex County > Newark (0.04)
- Asia > Thailand (0.04)
- North America > United States > New York > Richmond County > New York City (0.04)
- (3 more...)
Searching Clinical Data Using Generative AI
Hanswadkar, Karan, Kanchi, Anika, Tripathi, Shivani, Qiao, Shi, Chatterjee, Rony, Jindal, Alekh
Artificial Intelligence (AI) is making a major impact on healthcare, particularly through its application in natural language processing (NLP) and predictive analytics. The healthcare sector has increasingly adopted AI for tasks such as clinical data analysis and medical code assignment. However, searching for clinical information in large and often unorganized datasets remains a manual and error-prone process. Assisting this process with automations can help physicians improve their operational productivity significantly. In this paper, we present a generative AI approach, coined SearchAI, to enhance the accuracy and efficiency of searching clinical data. Unlike traditional code assignment, which is a one-to-one problem, clinical data search is a one-to-many problem, i.e., a given search query can map to a family of codes. Healthcare professionals typically search for groups of related diseases, drugs, or conditions that map to many codes, and therefore, they need search tools that can handle keyword synonyms, semantic variants, and broad open-ended queries. SearchAI employs a hierarchical model that respects the coding hierarchy and improves the traversal of relationships from parent to child nodes. SearchAI navigates these hierarchies predictively and ensures that all paths are reachable without losing any relevant nodes. To evaluate the effectiveness of SearchAI, we conducted a series of experiments using both public and production datasets. Our results show that SearchAI outperforms default hierarchical traversals across several metrics, including accuracy, robustness, performance, and scalability. SearchAI can help make clinical data more accessible, leading to streamlined workflows, reduced administrative burden, and enhanced coding and diagnostic accuracy.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > Oregon (0.04)
- Research Report > Experimental Study (0.97)
- Research Report > New Finding (0.87)
- Health & Medicine > Health Care Providers & Services (1.00)
- Health & Medicine > Diagnostic Medicine (1.00)
- Health & Medicine > Consumer Health (0.94)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.46)
Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare
Kokash, Natallia, Wang, Lei, Gillespie, Thomas H., Belloum, Adam, Grosso, Paola, Quinney, Sara, Li, Lang, de Bono, Bernard
The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.
- Europe > Netherlands > North Holland > Amsterdam (0.41)
- North America > United States > Ohio (0.04)
- North America > United States > Indiana (0.04)
- (5 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Overview (0.93)
Can Reasoning LLMs Enhance Clinical Document Classification?
Mustafa, Akram, Naseem, Usman, Azghadi, Mostafa Rahimi
Clinical document classification is essential for converting unstructured medical texts into standardised ICD-10 diagnoses, yet it faces challenges due to complex medical language, privacy constraints, and limited annotated datasets. Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task. This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat); in classifying clinical discharge summaries using the MIMIC-IV dataset. Using cTAKES to structure clinical narratives, models were assessed across three experimental runs, with majority voting determining final predictions. Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%), with Gemini 2.0 Flash Thinking achieving the highest accuracy (75%) and F1 score (76%). However, non-reasoning models demonstrated greater stability (91% vs 84% consistency). Performance varied across ICD-10 codes, with reasoning models excelling in complex cases but struggling with abstract categories. Findings indicate a trade-off between accuracy and consistency, suggesting that a hybrid approach could optimise clinical coding. Future research should explore multi-label classification, domain-specific fine-tuning, and ensemble methods to enhance model reliability in real-world applications.